Machine Learning-Driven Predictive Modeling for Opioid Use Disorder in Chronic Pain Patients on Long-Term Opioid Use
Overview
This predictive model aims to identify opioid use disorder among chronic pain patients on long-term opioid therapy by incorporating diverse risk factors, grounded in the biopsychosocial model of pain. It provides a robust predictive framework and serves as a foundation for more advanced machine learning models that will incorporate a broader range of psychiatric conditions, ultimately enhancing early identification and intervention strategies for at-risk patients. For this project, I am collaborating with two psychology experts specializing in substance use disorders, including opioid use disorder.
Link to final project GitHub repository) : https://github.com/yoonjae-hub/BMIN503_Final_Project.git
1. Introduction
Nearly 51.6 millions of Americans living with chronic pain (Rikard, Strahan et al. 2023). Managing chronic non-cancer pain (CNCP) is particularly challenging secondary to its complex neurobiological and psychosocial mechanisms (Fillingim, Bruehl et al. 2014), profoundly impacting patients’ physical, psychological, and social well-being (Genova, Dix et al. 2020). CNCP is defined as persistent or recurrent pain lasting three months or longer, and encompasses a range of non-malignant conditions, including neuropathic pain, rheumatoid arthritis, lower back pain, osteoarthritis, fibromyalgia, and various other conditions. A major concern with CNCP is the ongoing use of opioids. Opioids are widely prescribed for chronic pain, but can cause adverse outcomes, such as unintentional overdose, misuse, addiction, and even death (Dahlhamer, Connor et al. 2021, Dowell, Ragan et al. 2022). As many as 25% of patients treated with opioids for chronic pain are known to develop opioid use disorder (OUD) (Cordier Scott and Lewis 2016).
Leveraging AI in healthcare can transform OUD management by analyzing large datasets to develop machine learning (ML) algorithms that identify individuals at risk. This approach enables earlier intervention and more personalized care. Recently, the FDA approved AvertD, an ML-based risk score model designed to assess genetic predisposition to OUD using a prescription-only genotyping test (U.S. Food and Drug Administration 2024). However, AvertD relies solely on buccal swab specimens, excluding significant socioeconomic, physical, and psychological factors associated with OUD. Additionally, AvertD is designed for first-time opioid users, while OUD incidence is higher among individuals on long-term opioid therapy. This highlights the need for an ML model that targets chronic opioid users and integrates broader risk factors for better accessibility and applicability. Therefore, this study aims to develop an ML model to predict OUD among chronic pain patients who have been on opioid therapy, integrating biopsychosocial factors that are known to influence OUD risk.
This problem is inherently interdisciplinary because pain, particularly in the context of chronic pain and opioid use disorder (OUD), cannot be fully understood or addressed from a single disciplinary perspective. The biopsychosocial model emphasizes that pain is influenced not only by biological factors but also by psychological and social dimensions. Therefore, incorporating experts from psychology is essential to explore the psychological underpinnings of pain, such as emotional distress, coping mechanisms, and the role of comorbid psychiatric conditions like anxiety, depression, or substance use disorders.
2. Methods
Data source This project used dataset that were collected for a larger parent study that evaluated the phenotypic and genotypic profiles of CNCP patients on opioid therapy who did and did not develop an OUD. Data were collected between November 2012 and September 2018. This study includes 1,356 patients who were receiving long-term (6 months or more) opioid therapy (LTOT) to manage CNCP. Subjects were collected from three different regions (Pennsylvania, Washington, and Utah) in the United States. The eligibility criteria for the parent study were: individuals who were (1) aged 18 or older, (2) Caucasian and of European descent (defined as 3 or 4 grandparents were European), (3) experiencing CNCP of musculoskeletal or neuropathic origin persisting for at least 6 months, (4) having no history of substance abuse (except nicotine) before starting LTOT, and (5) having received LTOT to treat their pain for at least the past 6 months. Due to the genotypic and phenotypic objectives of the parent study, persons who are non-Caucasian or with a previous history of substance use disorder were excluded from the study. Additional criteria for exclusion included severe psychiatric conditions that hindered the ability to provide informed consent or complete the questionnaire; and individuals experiencing pain conditions not arising from musculoskeletal/neuropathic origin (e.g., cancer, gynecologic issues, abdominal discomfort, visceral concerns, dental problems, neuropathic pain due to metabolic disease).
Determination of OUD For this analysis, participants were grouped as either ‘cases’ or ‘controls’ based on whether or not they developed OUD after initiating LTOT. The control group included patients who did not have evidence of opioid abuse or meet the criteria for OUD at the time of assessment. Their electronic health records (EHR) were reviewed monthly for 12 consecutive months after enrolling in the study to ensure they did not develop an OUD after completing the baseline assessment. To be considered controls, patients receiving LTOT needed to have negative urine drug screens (which detect opioid metabolites and other illicit drugs), have no record of current or past SUD, and have evidence of no aberrant drug-related behaviors assessed using an expert developed checklist (Cheatle, O’Brien et al. 2013). Cases were patients who did not have a history of SUD (except nicotine) prior to beginning LTOT and currently met DSM-IV criteria for ‘opioid dependence’, which were obtained at the time of enrolling in the study.
Feature Selection Given the biopsychosocial aspects of OUD, the initial analysis will explore variables such as demographic factors (age, sex, race, ethnicity, marital status, education, employment, financial situation), pain (severity, interference), nicotine dependence, psychiatric conditions (depression, anxiety, pain catastrophizing, mental defeat, suicidality), and social support. To check multicollinearity among these variables, variance inflation factor (VIF) was used.
Implementation of ML This study will use a supervised learning approach to develop the predictive model. Given the binary outcome, algorithms such as logistic regression, random forest, and gradient boosting will be employed. To compare performance, each algorithm will be tested using area under the receiver operating characteristic curve (ROC-AOC). A potential challenge in this approach is missing data, as some variables rely on total scores. Missing data in these variables can lead to more extensive data gaps since individual items are not considered separately. To address this, missing data imputation techniques will be considered. Also, to ensure model generalization and detect potential overfitting/underfitting, cross-validation methods (k-fold cross-validation) will be applied.
2.1 Loading Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # For graphslibrary(lubridate) # For manipulating dateslibrary(gtsummary) #Summary statisticslibrary(RColorBrewer) # For coloring the plotslibrary(cowplot) # For combining multiple plots into one
Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':
stamp
library(ggpubr) # Combining multiple graphs into one figure
Attaching package: 'ggpubr'
The following object is masked from 'package:cowplot':
get_legend
library(pROC) # For cross-validation
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
library(dplyr) # For data cleaninglibrary(gtsummary)library(haven)library(rsample)library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
#Predictors#Demographic - gender (DEM002), age (DEM003), marital status (DEM004), living along (DEM005), education (DEM006), employment (DEM007), financial status (DEM008)#Plesae note that due to the genotypic and phenotypic objectives of the parent study, persones who are non-Caucasian or with a previous history of substance use disorder were excluded from the study)demp_v2 <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/demp_v2.sav")#Pain - severity, interference - Brief Pain Inventorybpip <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/bpip.sav")#Smoking- Fagerstrom scalefnts<-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/fnts.sav")#Alcoholaadidm <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/aadidm.sav")#Anxiety/Depression-PHQ-4phq4 <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/phq4.sav")#Pain catastrophizing (Subscale) - Coping strategies questionnairecopsq<-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/copsq.sav")#Mental defeat-Pain perception scalepps <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pps.sav")#Suicidality - SBQ-Rsbqr <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/sbqr.sav")#Social support - Duke social support indexdssi_v2 <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/dssi_v2.sav")#Outcome - Opioid Use Disorderpevc <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pevc.sav")
[1] "Live alone" "One other person"
[3] "More than one other person"
demo$DEM006[which(demo$DEM006 =="(X)Subject refused to answer")] <-NAdemo$DEM006 <-droplevels(demo$DEM006)levels(demo$DEM006)
[1] "Grade 6 or less"
[2] "Grade 7 to 12"
[3] "Graduated high school or high school equivalent"
[4] "Part college"
[5] "Graduated 2 years college"
[6] "Graduated 4 years college"
[7] "Part graduate/professional school"
[8] "Completed graduate/professional school"
demo$DEM007 <-as.character(demo$DEM007)demo$DEM007[which(demo$DEM007 =="Do not know/refused")] <-NAdemo$DEM007 <-as_factor(demo$DEM007)demo$DEM007 <-droplevels(demo$DEM007)levels(demo$DEM007)
[1] "Yes" "No"
demo$DEM008 <-as.character(demo$DEM008)demo$DEM008[which(demo$DEM008 =="Do not know/Refused")] <-NAdemo$DEM008 <-as_factor(demo$DEM008)demo$DEM008 <-droplevels(demo$DEM008)levels(demo$DEM008)
[1] "Ca not make ends meet" "Have just enough to get along"
[3] "Are comfortable"
#transforming to factor (all, except DEM003 - age)demo$DEM002 <-as_factor(demo$DEM002)demo$DEM003 <-as.numeric(demo$DEM003)demo$DEM004 <-as_factor(demo$DEM004)demo$DEM005 <-as_factor(demo$DEM005)demo$DEM006 <-as_factor(demo$DEM006)demo$DEM007 <-as_factor(demo$DEM007)demo$DEM008 <-as_factor(demo$DEM008)#checking (examples)class(demo$DEM004) # Should show "factor"
[1] "factor"
levels(demo$DEM005) # no 'do not know'refused'
[1] "Live alone" "One other person"
[3] "More than one other person"
is.numeric(demo$DEM005) # Should return FALSE
[1] FALSE
#pain#pain-severitypain_severity <- bpip %>%filter(month ==0) %>%mutate(pain_severity_sum =if_else(if_any(c(BPI001, BPI002, BPI003, BPI004), ~ . ==-9),NA_real_, BPI001 + BPI002 + BPI003 + BPI004)) %>%select(pain_severity_sum, SubjID)#pain-interferencepain_interf <- bpip %>%filter(month ==0) %>%mutate(pain_interf_sum =if_else(if_any(c(BPI007A, BPI007B, BPI007C, BPI007D, BPI007E, BPI007G), ~ . ==-9),NA_real_, # changing -9 to NA value BPI007A + BPI007B + BPI007C + BPI007D + BPI007E + BPI007G)) %>%select(pain_interf_sum, SubjID)#smoking : conditional survey (if FNTS000 entry question is 0, rest of questions are a system missing value; No missing value in FNTS000)missing_fnts <-sum(is.na(fnts$FNTS000))missing_fnts #for the initial screening question, there is no missing data.
[1] 0
# 001, 004: Multiple-choice items (0-3) / # 2,3,5,6: Yes/No items (0-1)fnts_cleaned <- fnts %>%mutate(fnts_sum =ifelse(FNTS000 ==0, NA, FNTS001 + FNTS004 + FNTS002 + FNTS003 + FNTS005 + FNTS006)) %>%replace_na(list(fnts_sum=0))fnts_cleaned <- fnts_cleaned %>%filter(month==0)fnts_cleaned <- fnts_cleaned %>%mutate(dependence_level =case_when( fnts_sum ==0~"Non-smoker", fnts_sum >=1& fnts_sum <=2~"Low", fnts_sum >=3& fnts_sum <=4~"Low to Moderate", fnts_sum >=5& fnts_sum <=7~"Moderate", fnts_sum >=8~"High")) %>%select(SubjID, fnts_sum, dependence_level)fnts_cleaned_2 <- fnts_cleaned %>%select(SubjID, dependence_level) #sort out only two variables#Alcohol use - more conservative, in the last 3 monthsaadidm_cleaned <- aadidm %>%filter(month ==0) %>%mutate(alcohol =ifelse(AAD001 ==1, 1, 0),alcohol =replace(alcohol, is.na(alcohol), 0) # Replace NA with 0 ) %>%select(SubjID, alcohol)table(aadidm_cleaned$alcohol)
#Suicidality: cutoff >=8 (higher risk) for clinical samples / no missing values, but '-9' refused to answer-> Total score is NA whenever a -9 value is present in any of the specified columns. (NA in total score = only 6 data)#scoring website: https://www.ncbi.nlm.nih.gov/books/NBK571017/box/ch3.b11/?report=objectonlysbqr_cleaned <- sbqr %>%filter(month==0) %>%mutate(SBQ001 =case_when( SBQ001 %in%c(3, 4) ~3, #regrouping SBQ001 %in%c(5, 6) ~4,TRUE~ SBQ001),SBQ003 =case_when( SBQ003 %in%c(2, 3) ~2, SBQ003 %in%c(4, 5) ~3,TRUE~ SBQ003))sbqr_cleaned <- sbqr_cleaned %>%mutate(sbqr_sum =if_else(rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004) ==-9) >0,NA_real_,rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004), na.rm =TRUE))) %>%select(SubjID, sbqr_sum)#Categorize suicidality risk based on sbqr_sum (>=8; high risk, <8: low risk, NA=> NA)(>=8; high risk, <8: low risk, NA=> NA); missing data = 6easbqr_cleaned <- sbqr_cleaned %>%mutate(suicidality_risk =case_when( sbqr_sum >=8~"High Risk", sbqr_sum <8~"Low Risk",is.na(sbqr_sum) ~NA_character_))suicidality <- sbqr_cleaned %>%select(SubjID, suicidality_risk)#social support (DSSI) : 3 subscales scoresdssi_cleaned <- dssi_v2 %>%filter(month==0) %>%select(SubjID, SIDSSI, SSDSSI, ISDSSI)#social interaction (SIS, 4 item) : assessing the amount of time: 3 point likert scale (missing data: NONE)## dssi_cleaned$SIDSSI#subjective social support (SSS, 7 items): assessing depth or quality of relationship: 3point (missing data: 14)## dssi_cleaned$SSDSSI#Instrumental social support (ISS, 12 items): degree to which others can be counted upon with daily activities;0/1; (missing data: 4)## dssi_cleaned$ISDSSI#outcome: opioid use disorder (cohort - 1: case=OUD, 0: control=no OUD)cohort <-read.csv("/Users/yoonjaelee/Desktop/Dr.CHEATLE ML PROJECT/pevc_compiled.csv")cohort_cleaned <- cohort %>%select(SubjID,cohort)cohort_cleaned$cohort <-as.factor(cohort_cleaned$cohort) #changing the class to categorical values.class(cohort_cleaned$cohort) #factor
##Merging data_using left_joinmerging1 <-left_join(cohort_cleaned, demo, by ="SubjID")## Note that when merging demo file with cohort_cleaned file, there are missing data -> remove; not registered for the study. (removed 173 patients)merging1 <- merging1 %>%drop_na()merging2 <-left_join(merging1, pain_severity, by ="SubjID")merging3 <-left_join(merging2, pain_interf, by ="SubjID")merging4 <-left_join(merging3, fnts_cleaned_2, by ="SubjID")merging5 <-left_join(merging4, phq4_cleaned, by ="SubjID")merging6 <-left_join(merging5, ca_catastro, by ="SubjID")merging7 <-left_join(merging6, pps_cleaned, by ="SubjID")merging8 <-left_join(merging7, suicidality, by ="SubjID")merging9 <-left_join(merging8, aadidm_cleaned, by ="SubjID")final_data <-left_join(merging9, dssi_cleaned, by ="SubjID") #final sample N= 1331table(final_data$cohort) #901/430
0 1
901 430
2.4 Exploratory Data Analysis (Distribution/Association)
Based on distribution analysis, most of variables were skewed. However, given the large sample size, we can rely on the Central Limit Theorem to approximate normalilty.
#Categorical variables#1-1) Fnts - Visualizationfnts_eda <-left_join(cohort_cleaned, fnts_cleaned, by ="SubjID") %>%filter(!is.na(dependence_level), !is.na(cohort))ggplot(fnts_eda, aes(x = dependence_level, fill =as.factor(cohort))) +geom_bar(position ="dodge") +labs(x ="Dependence Level", y ="Count", fill ="OUD Status") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
oud_smoking <-table(final_data$cohort, final_data$dependence_level)print(oud_smoking) # All cells >5. okay to perform chi-square test
High Low Low to Moderate Moderate Non-smoker
0 20 54 47 59 694
1 43 49 75 123 109
F test to compare two variances
data: ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
F = 0.7723, num df = 875, denom df = 399, p-value = 0.002074
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.6513741 0.9108211
sample estimates:
ratio of variances
0.7723014
Welch Two Sample t-test
data: ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
t = 5.222, df = 690.86, p-value = 2.345e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.583093 3.490811
sample estimates:
mean of x mean of y
21.72945 19.19250
#2-1)pain interference - Visualizationpi_data_noNA <- final_data %>%filter(!is.na(cohort), !is.na(pain_interf_sum))ggplot(pi_data_noNA, aes(x =as.factor(cohort), y = pain_interf_sum, fill =as.factor(cohort))) +geom_violin(fill ="lightyellow", draw_quantiles =c(0.25, 0.5, 0.75)) +labs(x ="Cohort", y ="Pain Interference Sum", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +geom_jitter(height =0, width =0.1) +theme_minimal()
ggplot(pi_data_noNA, aes(x = pain_interf_sum, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="Pain Interference Sum", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
F test to compare two variances
data: pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
F = 0.9773, num df = 826, denom df = 390, p-value = 0.7834
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8219306 1.1559193
sample estimates:
ratio of variances
0.9773002
Two Sample t-test
data: pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
t = 2.168, df = 1216, p-value = 0.03035
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.194941 3.906501
sample estimates:
mean of x mean of y
33.58525 31.53453
#3-1) pain catastrophizing - Distributionpc_data_noNA <- final_data %>%filter(!is.na(cohort), !is.na(ca_sum))ggplot(pc_data_noNA, aes(x =as.factor(cohort), y = ca_sum, fill =as.factor(cohort))) +geom_violin(fill ="lightyellow", draw_quantiles =c(0.25, 0.5, 0.75)) +labs(x ="Cohort", y ="Pain Catastrophizing Sum", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +geom_jitter(height =0, width =0.1) +theme_minimal()
ggplot(pc_data_noNA, aes(x = ca_sum, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="Pain Catastrohpizing", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
F test to compare two variances
data: pc.nonoud$ca_sum and pc.oud$ca_sum
F = 1.2806, num df = 836, denom df = 366, p-value = 0.006322
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.072940 1.519366
sample estimates:
ratio of variances
1.28062
Welch Two Sample t-test
data: pc.nonoud$ca_sum and pc.oud$ca_sum
t = -7.9363, df = 784.18, p-value = 7.194e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.699746 -2.835854
sample estimates:
mean of x mean of y
14.26762 18.03542
#4-1)PSPS - Mental defeat - Distributionpsps_data_noNA <- final_data %>%filter(!is.na(pps_sum))ggplot(psps_data_noNA, aes(x =as.factor(cohort), y = pps_sum, fill =as.factor(cohort))) +geom_violin(fill ="lightyellow", draw_quantiles =c(0.25, 0.5, 0.75)) +labs(x ="Cohort", y ="Mental defeat", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +geom_jitter(height =0, width =0.1) +theme_minimal()
ggplot(psps_data_noNA, aes(x = ca_sum, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="Mental Defeat", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_density()`).
F test to compare two variances
data: md.nonoud$pps_sum and md.oud$pps_sum
F = 1.1209, num df = 582, denom df = 225, p-value = 0.3163
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.896521 1.386564
sample estimates:
ratio of variances
1.120889
Two Sample t-test
data: md.nonoud$pps_sum and md.oud$pps_sum
t = -6.6498, df = 807, p-value = 5.407e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-18.371066 -9.997216
sample estimates:
mean of x mean of y
31.39108 45.57522
#5-1) DSSI - Distribution (SIDSSI, SSDSSI, ISDSSI (all continuous))dssi_data_noNA <- final_data %>%filter(!is.na(SIDSSI), !is.na(SSDSSI), !is.na(ISDSSI))ggplot(dssi_data_noNA, aes(x = SIDSSI, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="SIDSSI", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
ggplot(dssi_data_noNA, aes(x = SSDSSI, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="SSDSSI", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
ggplot(dssi_data_noNA, aes(x = ISDSSI, fill =as.factor(cohort))) +geom_histogram(aes(y = ..density..), bins =20, alpha =0.5, position ="identity") +geom_density(alpha =0.3) +facet_wrap(~ cohort, labeller =as_labeller(c("0"="Control", "1"="Case"))) +labs(title ="Histogram with Density Curve", x ="ISDSSI", fill ="Cohort") +scale_fill_discrete(labels =c("Control", "Case")) +theme_minimal()
F test to compare two variances
data: sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
F = 0.94976, num df = 838, denom df = 375, p-value = 0.5488
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7970469 1.1252390
sample estimates:
ratio of variances
0.9497572
sidssi_ttest <-t.test(sidssi.nonoud$SIDSSI, sidssi.oud$SIDSSI, var.equal=T)print(sidssi_ttest) #statistically NOT different.
Two Sample t-test
data: sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
t = 0.2859, df = 1213, p-value = 0.775
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2432314 0.3262139
sample estimates:
mean of x mean of y
5.193087 5.151596
F test to compare two variances
data: ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
F = 0.79046, num df = 838, denom df = 375, p-value = 0.006448
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.6633604 0.9365058
sample estimates:
ratio of variances
0.7904571
Welch Two Sample t-test
data: ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
t = 9.7284, df = 651.18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.724496 2.596701
sample estimates:
mean of x mean of y
17.94517 15.78457
F test to compare two variances
data: isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
F = 0.67218, num df = 838, denom df = 375, p-value = 3.597e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5641025 0.7963774
sample estimates:
ratio of variances
0.6721818
Welch Two Sample t-test
data: isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
t = 3.294, df = 610.18, p-value = 0.001045
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2609288 1.0314215
sample estimates:
mean of x mean of y
8.702026 8.055851
2.4.1 Descriptive Analysis
Sample characteristics
Gender distribution is relatively even between groups. Individuals with OUD are generally younger. Marital status is higher among non-OUD individuals.Non-OUD individuals tend to have higher levels of education. Financial stability is less common among individuals with OUD.
library(Hmisc)
Attaching package: 'Hmisc'
The following object is masked from 'package:parsnip':
translate
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(table1)
Attaching package: 'table1'
The following objects are masked from 'package:Hmisc':
label, label<-, units
The following objects are masked from 'package:base':
units, units<-
# Preprocessing: Filter out rows where "Refused to answer" is presentfinal_data_clean <- final_data[!(final_data$DEM002 =="(X)Subject refused to answer"| final_data$DEM004 =="(X)Subject refused to answer"), ]final_data_clean <-droplevels(final_data_clean)# Relabel the cohort variablefinal_data_clean$cohort <-factor(final_data_clean$cohort, levels =c(0, 1), labels =c("Non-OUD", "OUD"))# Assigning labels to variables for better readabilitylabel(final_data_clean$DEM002) <-"Gender"label(final_data_clean$DEM003) <-"Age (years)"label(final_data_clean$DEM004) <-"Marital Status"label(final_data_clean$DEM005) <-"Household"label(final_data_clean$DEM006) <-"Education"label(final_data_clean$DEM007) <-"Employment"label(final_data_clean$DEM008) <-"Financial Status"# Print demographic characteristicscat("Demographic characteristics in non-OUD and OUD groups")
Demographic characteristics in non-OUD and OUD groups
Predictors - descriptive analysis (not working due to missingness)
# Relabel the cohort variablefinal_data_clean$cohort <-factor(final_data_clean$cohort, levels =c(0, 1), labels =c("Non-OUD", "OUD"))# Assigning labels to variables for better readabilitylabel(final_data_clean$pain_severity_sum) <-"Pain severity"label(final_data_clean$pain_interf_sum) <-"Pain interference"label(final_data_clean$dependence_level) <-"Nicotine dependence"label(final_data_clean$Anxiety) <-"Anxiety"label(final_data_clean$Depression) <-"Depression"label(final_data_clean$ca_sum) <-"Pain Catastrophizing"label(final_data_clean$pps_sum) <-"Mental defeat"label(final_data_clean$suicidality_risk) <-"Suicidality"label(final_data_clean$SIDSSI) <-"Social interaction"label(final_data_clean$SSDSSI) <-"Satisfaction with social support"label(final_data_clean$ISDSSI) <-"Social support"# Print demographic characteristicscat("Clinical and Psychosocial Characteristics Stratified by OUD Status")
Clinical and Psychosocial Characteristics Stratified by OUD Status
2.5 Developing ML models
Given the skewness of the dataset, three algorithms were selected for their effectiveness in handling skewed variable distributions: Logistic Regression, Random Forest, and XGBoost.
2.5.1 Missing data Imputation
Before developing models, I need to address missing data. For this, random forest imputation was utilized. This method accounts for the distribution of variables and imputes values accordingly, rather than relying on mean, median, or mode. For continuous predictors, the imputed value is calculated as the weighted average of non-missing observations, with weights determined by proximities. For categorical predictors, the imputed value corresponds to the category with the highest average proximity. While most variables had less than 10% missing data, the mental defeat score (pps_sum) and suicidality risk exhibited significant levels of missingness. However, these variables are clinically important factors associated with the outcome (OUD). As such, they were included in the prediction model.
To evaluate whether these factors can be reliably incorporated and whether the imputation was performed appropriately, I will compare models with and without these variables. This analysis will be discussed in the section titled “Model Selection Despite Missing Data Imputation.”
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
##Missing data count when mergedcolSums(is.na(final_data))
# Missing data imputation, using randomforest packagestr(final_data)
'data.frame': 1331 obs. of 21 variables:
$ SubjID : num 1000 1001 1002 1003 1004 ...
$ cohort : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ DEM002 : Factor w/ 3 levels "(X)Subject refused to answer",..: 3 3 2 3 3 3 2 2 3 2 ...
..- attr(*, "label")= chr "What is your gender"
$ DEM003 : num 35 32 28 42 27 39 24 42 39 39 ...
$ DEM004 : Factor w/ 6 levels "(X)Subject refused to answer",..: 2 2 4 2 2 2 5 2 4 2 ...
..- attr(*, "label")= chr "Current marital status"
$ DEM005 : Factor w/ 3 levels "Live alone","One other person",..: 1 2 1 2 2 2 1 2 1 3 ...
$ DEM006 : Factor w/ 8 levels "Grade 6 or less",..: 1 4 3 5 6 7 6 2 5 3 ...
$ DEM007 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 1 2 ...
$ DEM008 : Factor w/ 3 levels "Ca not make ends meet",..: 1 2 1 2 3 2 2 3 3 1 ...
$ pain_severity_sum: num NA 32 25 12 13 20 22 23 22 25 ...
$ pain_interf_sum : num NA 48 50 3 15 39 54 14 55 35 ...
$ dependence_level : chr "Moderate" "Non-smoker" "Non-smoker" "Non-smoker" ...
$ Anxiety : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 1 ...
$ Depression : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 2 1 ...
$ ca_sum : num 22 9 18 5 12 16 18 0 30 21 ...
$ pps_sum : num NA NA NA 10 44 NA 26 NA NA NA ...
$ suicidality_risk : chr NA NA NA NA ...
$ alcohol : num 0 NA NA NA NA NA 1 0 1 0 ...
$ SIDSSI : num NA 5 4 10 3 4 5 9 5 9 ...
..- attr(*, "label")= chr "Total score for Social Interaction"
..- attr(*, "format.spss")= chr "F8.2"
$ SSDSSI : num NA 13 9 21 21 17 15 21 10 21 ...
..- attr(*, "label")= chr "Total score for Subjective Support"
..- attr(*, "format.spss")= chr "F8.2"
$ ISDSSI : num NA 8 10 6 12 9 10 9 0 11 ...
..- attr(*, "label")= chr "Total score for Instrumental Support"
..- attr(*, "format.spss")= chr "F8.2"
# Using rfImpute to impute missing values in predictor variablesfinal_data <- final_data %>%mutate(across(where(is.character), as.factor))imputed_data <-rfImpute(cohort ~ ., data = final_data, iter =5, ntree =500)
cohort DEM002 DEM003 DEM004 DEM005
1 0 Female 35 Married/Partnered Live alone
2 0 Female 32 Married/Partnered One other person
3 0 Male 28 Divorced Live alone
4 0 Female 42 Married/Partnered One other person
5 0 Female 27 Married/Partnered One other person
6 0 Female 39 Married/Partnered One other person
DEM006 DEM007
1 Grade 6 or less Yes
2 Part college No
3 Graduated high school or high school equivalent No
4 Graduated 2 years college No
5 Graduated 4 years college Yes
6 Part graduate/professional school No
DEM008 pain_severity_sum pain_interf_sum
1 Ca not make ends meet 22.77192 36.78
2 Have just enough to get along 32.00000 48.00
3 Ca not make ends meet 25.00000 50.00
4 Have just enough to get along 12.00000 3.00
5 Are comfortable 13.00000 15.00
6 Have just enough to get along 20.00000 39.00
dependence_level Anxiety Depression ca_sum pps_sum suicidality_risk
1 Moderate 1 1 22 33.00521 Low Risk
2 Non-smoker 1 1 9 33.80834 Low Risk
3 Non-smoker 1 1 18 37.47518 Low Risk
4 Non-smoker 1 0 5 10.00000 Low Risk
5 Non-smoker 1 1 12 44.00000 Low Risk
6 Non-smoker 1 1 16 30.35042 Low Risk
alcohol SIDSSI SSDSSI ISDSSI
1 0.0000000 5.135818 18.11672 8.897234
2 0.4496675 5.000000 13.00000 8.000000
3 0.3850812 4.000000 9.00000 10.000000
4 0.4425485 10.000000 21.00000 6.000000
5 0.5076271 3.000000 21.00000 12.000000
6 0.4504239 4.000000 17.00000 9.000000
cohort DEM002 DEM003
0:901 (X)Subject refused to answer: 0 Min. : 2.00
1:430 Male :530 1st Qu.:23.00
Female :801 Median :35.00
Mean :33.92
3rd Qu.:44.00
Max. :69.00
DEM004 DEM005
(X)Subject refused to answer: 0 Live alone :229
Married/Partnered :637 One other person :523
Separated : 61 More than one other person:579
Divorced :231
Never married :355
Widowed : 47
DEM006 DEM007
Graduated high school or high school equivalent:350 Yes:350
Part college :336 No :981
Graduated 4 years college :173
Graduated 2 years college :149
Completed graduate/professional school :141
Grade 7 to 12 :126
(Other) : 56
DEM008 pain_severity_sum pain_interf_sum
Ca not make ends meet :392 Min. : 0.00 Min. : 0.00
Have just enough to get along:595 1st Qu.:16.55 1st Qu.:23.00
Are comfortable :344 Median :21.00 Median :34.00
Mean :20.87 Mean :32.76
3rd Qu.:26.00 3rd Qu.:44.00
Max. :40.00 Max. :60.00
dependence_level Anxiety Depression ca_sum pps_sum
High : 69 0:652 0:708 Min. : 0.00 Min. : 0.00
Low :107 1:679 1:623 1st Qu.:10.00 1st Qu.:17.41
Low to Moderate:122 Median :15.00 Median :30.37
Moderate :196 Mean :15.42 Mean :35.32
Non-smoker :837 3rd Qu.:21.00 3rd Qu.:53.08
Max. :36.00 Max. :96.00
suicidality_risk alcohol SIDSSI SSDSSI
High Risk: 330 Min. :0.0000 Min. : 0.000 Min. : 7.00
Low Risk :1001 1st Qu.:0.0000 1st Qu.: 4.000 1st Qu.:15.00
Median :0.0000 Median : 5.000 Median :18.00
Mean :0.3911 Mean : 5.158 Mean :17.27
3rd Qu.:1.0000 3rd Qu.: 7.000 3rd Qu.:20.00
Max. :1.0000 Max. :13.000 Max. :21.00
ISDSSI
Min. : 0.00
1st Qu.: 7.00
Median : 9.00
Mean : 8.48
3rd Qu.:11.00
Max. :12.00
2.5.2 Missing data Imputation
library(rsample)set.seed (1234)mydata_split <-initial_split(imputed_data,strata = cohort,# maintaining same percentage of cases/controlsprop =0.80)mydata_training <-training(mydata_split)mydata_test <-testing(mydata_split)#10-fold cross validation: given there is a small sample size and I am not doing parameter tuning, I will not create held out data and use the whole dataset.set.seed(1234)mydata_folds <-vfold_cv(imputed_data, v=10) #okay to use the whole data, without held out datamydata_folds
2.5.3. Model Selection With Missing Data Imputation
In order to determine whether the imputation was done properly without changing the model performance too much, I will compare two models; one with all variables included and the other with excluding these two variabels. I could also compare the model using complete cases, but given that sample size between the models too different, I chose this comparison. When comparing, along with auc roc value, false negative (which has more clinically importance) will also be considered. I will select the model that has better performance with false negative. significant predictors in two models will also be compared.
Model excluding the two variables with significant missingness
This model can be compared with the model including these two variables in 2.5.4.
imputed_data_drop <- imputed_data %>%select(-pps_sum, -suicidality_risk)set.seed (1234)mydata_split_drop <-initial_split(imputed_data_drop,strata = cohort,#maintaining same percentage of cases/controlsprop =0.80)mydata_training_drop <-training(mydata_split_drop)mydata_test_drop <-testing(mydata_split_drop)mydata_glm_drop <-glm (cohort ~., data = mydata_training_drop, family =binomial(logit))summary(mydata_glm_drop)
Call:
glm(formula = cohort ~ ., family = binomial(logit), data = mydata_training_drop)
Coefficients:
Estimate Std. Error
(Intercept) 4.942656 1.499017
DEM002Female -0.856955 0.201204
DEM003 -0.064206 0.009429
DEM004Separated 0.937199 0.430508
DEM004Divorced 0.836902 0.272311
DEM004Never married 1.010768 0.260428
DEM004Widowed 0.592886 0.669202
DEM005One other person 0.271899 0.305707
DEM005More than one other person 0.521904 0.304877
DEM006Grade 7 to 12 -0.073755 1.181374
DEM006Graduated high school or high school equivalent -0.466416 1.153980
DEM006Part college -0.911793 1.157818
DEM006Graduated 2 years college -1.415609 1.180843
DEM006Graduated 4 years college -1.039536 1.179874
DEM006Part graduate/professional school -2.107400 1.357155
DEM006Completed graduate/professional school -0.709468 1.198047
DEM007No -0.234212 0.233332
DEM008Have just enough to get along -0.439306 0.216660
DEM008Are comfortable -0.888200 0.330163
pain_severity_sum -0.063186 0.017198
pain_interf_sum -0.007501 0.009782
dependence_levelLow -0.321873 0.470807
dependence_levelLow to Moderate 0.499488 0.470429
dependence_levelModerate 0.438853 0.436700
dependence_levelNon-smoker -1.415626 0.409711
Anxiety1 -0.101627 0.245515
Depression1 0.174846 0.248811
ca_sum 0.044102 0.015764
alcohol -0.707190 0.213333
SIDSSI 0.145705 0.047275
SSDSSI -0.070290 0.036055
ISDSSI -0.067315 0.038020
z value Pr(>|z|)
(Intercept) 3.297 0.000976 ***
DEM002Female -4.259 2.05e-05 ***
DEM003 -6.809 9.80e-12 ***
DEM004Separated 2.177 0.029483 *
DEM004Divorced 3.073 0.002117 **
DEM004Never married 3.881 0.000104 ***
DEM004Widowed 0.886 0.375639
DEM005One other person 0.889 0.373782
DEM005More than one other person 1.712 0.086924 .
DEM006Grade 7 to 12 -0.062 0.950219
DEM006Graduated high school or high school equivalent -0.404 0.686080
DEM006Part college -0.788 0.430983
DEM006Graduated 2 years college -1.199 0.230601
DEM006Graduated 4 years college -0.881 0.378287
DEM006Part graduate/professional school -1.553 0.120469
DEM006Completed graduate/professional school -0.592 0.553725
DEM007No -1.004 0.315488
DEM008Have just enough to get along -2.028 0.042598 *
DEM008Are comfortable -2.690 0.007141 **
pain_severity_sum -3.674 0.000239 ***
pain_interf_sum -0.767 0.443208
dependence_levelLow -0.684 0.494189
dependence_levelLow to Moderate 1.062 0.288339
dependence_levelModerate 1.005 0.314930
dependence_levelNon-smoker -3.455 0.000550 ***
Anxiety1 -0.414 0.678922
Depression1 0.703 0.482226
ca_sum 2.798 0.005149 **
alcohol -3.315 0.000917 ***
SIDSSI 3.082 0.002056 **
SSDSSI -1.950 0.051232 .
ISDSSI -1.771 0.076644 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1339.23 on 1063 degrees of freedom
Residual deviance: 724.28 on 1032 degrees of freedom
AIC: 788.28
Number of Fisher Scoring iterations: 6
#Model buildinglr_cls_spec_2 <-logistic_reg() |>set_engine("glm")#Model fit lr_cls_fit_2 <- lr_cls_spec_2 |>fit(cohort ~ ., data = mydata_training_drop)#Prediction using the "TEST" datasetmydata.lr.pred.values.test_2 <-bind_cols(truth = mydata_test_drop$cohort,predict(lr_cls_fit_2, mydata_test_drop),predict(lr_cls_fit_2, mydata_test_drop, type ="prob"))mydata.lr.pred.values.test_2
Result: Given the similarity in model performance, the model that includes the two variables is preferable, as these variables hold clinical significance. More importantly, the imputation process does not appear to negatively affect the model’s performance. Especially, the model with the two variables showed a slightly higher AUC-ROC value (0.88596 < 0.88655) and fewer false negative cases (exclusion model n=29 Vs inclusion model n=28 ). Clinically, minimizing false negatives is crucial in a predictive model. Therefore, further development of the algorithm will proceed using the imputed data with these two variables included.
2.5.4. Logistic regression
glm (Training/Test)
#general model, with the whole datasetforwholeglm <- final_data %>%select(-SubjID)whole_glm <-glm(cohort ~., data = forwholeglm, family=binomial(logit))summary(whole_glm)
Call:
glm(formula = cohort ~ ., family = binomial(logit), data = forwholeglm)
Coefficients:
Estimate Std. Error
(Intercept) 6.305e+00 2.434e+00
DEM002Female -8.908e-01 2.561e-01
DEM003 -6.194e-02 1.246e-02
DEM004Separated 5.731e-01 6.136e-01
DEM004Divorced 3.985e-02 3.799e-01
DEM004Never married 4.745e-01 3.256e-01
DEM004Widowed -2.376e-01 7.551e-01
DEM005One other person -4.703e-01 4.046e-01
DEM005More than one other person -2.187e-01 4.111e-01
DEM006Grade 7 to 12 1.349e-01 2.160e+00
DEM006Graduated high school or high school equivalent -5.028e-02 2.139e+00
DEM006Part college -4.823e-01 2.138e+00
DEM006Graduated 2 years college -8.433e-01 2.155e+00
DEM006Graduated 4 years college 1.376e-01 2.149e+00
DEM006Part graduate/professional school -6.214e-01 2.266e+00
DEM006Completed graduate/professional school -2.230e-01 2.160e+00
DEM007No 1.324e-01 3.052e-01
DEM008Have just enough to get along -9.866e-02 2.796e-01
DEM008Are comfortable -1.135e+00 4.317e-01
pain_severity_sum -2.457e-02 2.214e-02
pain_interf_sum -2.181e-02 1.265e-02
dependence_levelLow -7.406e-01 5.965e-01
dependence_levelLow to Moderate 1.912e-01 5.847e-01
dependence_levelModerate 3.932e-01 5.426e-01
dependence_levelNon-smoker -1.420e+00 5.005e-01
Anxiety1 2.331e-01 3.254e-01
Depression1 1.413e-01 3.393e-01
ca_sum 3.054e-02 2.403e-02
pps_sum -9.971e-05 6.625e-03
suicidality_riskLow Risk -6.244e-01 3.303e-01
alcohol -6.094e-01 2.638e-01
SIDSSI 1.692e-01 5.788e-02
SSDSSI -1.355e-01 4.488e-02
ISDSSI -5.281e-02 4.639e-02
z value Pr(>|z|)
(Intercept) 2.590 0.009601 **
DEM002Female -3.479 0.000504 ***
DEM003 -4.971 6.66e-07 ***
DEM004Separated 0.934 0.350314
DEM004Divorced 0.105 0.916452
DEM004Never married 1.457 0.144996
DEM004Widowed -0.315 0.752988
DEM005One other person -1.162 0.245090
DEM005More than one other person -0.532 0.594660
DEM006Grade 7 to 12 0.062 0.950198
DEM006Graduated high school or high school equivalent -0.024 0.981249
DEM006Part college -0.226 0.821511
DEM006Graduated 2 years college -0.391 0.695617
DEM006Graduated 4 years college 0.064 0.948936
DEM006Part graduate/professional school -0.274 0.783887
DEM006Completed graduate/professional school -0.103 0.917754
DEM007No 0.434 0.664494
DEM008Have just enough to get along -0.353 0.724211
DEM008Are comfortable -2.628 0.008584 **
pain_severity_sum -1.110 0.267008
pain_interf_sum -1.724 0.084781 .
dependence_levelLow -1.242 0.214378
dependence_levelLow to Moderate 0.327 0.743639
dependence_levelModerate 0.725 0.468598
dependence_levelNon-smoker -2.837 0.004547 **
Anxiety1 0.716 0.473818
Depression1 0.416 0.677142
ca_sum 1.271 0.203667
pps_sum -0.015 0.987992
suicidality_riskLow Risk -1.890 0.058735 .
alcohol -2.310 0.020901 *
SIDSSI 2.924 0.003460 **
SSDSSI -3.020 0.002530 **
ISDSSI -1.138 0.254915
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 785.04 on 607 degrees of freedom
Residual deviance: 445.80 on 574 degrees of freedom
(723 observations deleted due to missingness)
AIC: 513.8
Number of Fisher Scoring iterations: 6
#Model buildinglr_cls_spec <-logistic_reg() |>set_engine("glm")#Model fit to training datasetlr_cls_fit <- lr_cls_spec |>fit(cohort ~ ., data = mydata_training)#Prediction using the "testing" datasetmydata.lr.pred.values.test <-bind_cols(truth = mydata_test$cohort,predict(lr_cls_fit, mydata_test),predict(lr_cls_fit, mydata_test, type ="prob"))mydata.lr.pred.values.test
#model buildinglr_cls_spec_cv <-logistic_reg() |>set_engine("glm")#Model fit to FULL datasetlr_cls_fit_cv <- lr_cls_spec_cv |>fit(cohort ~ ., data = imputed_data) #here, I'm using full data.#Create a workflow() for fitting the glmglm_wf <-workflow() |>add_model(lr_cls_spec_cv) |>add_formula(cohort ~ .)#Cross validation fittingglm_fit_cv <- glm_wf |>fit_resamples(mydata_folds, control =control_resamples(save_pred =TRUE))#Collect predictions out of folds into one tibblemydata_glm_cv_preds <-collect_predictions(glm_fit_cv)#Plot of ROC curve of CV resultsautoplot(roc_curve(mydata_glm_cv_preds, cohort, .pred_0))
#To see performance of each foldmydata_glm_cv_preds |>group_by(id) |>roc_auc(cohort, .pred_0)
The following object is masked from 'package:utils':
vi
#Model specification/buildingrf_spec <-rand_forest(trees =300,min_n =5) |>#Min number of data points in node for further splitset_engine("randomForest", importance =TRUE) |>set_mode("classification")#Model fitrf_fit <- rf_spec |>fit(cohort ~ ., data = mydata_training)rf_fit
parsnip model object
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~300, nodesize = min_rows(~5, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 4
OOB estimate of error rate: 14.38%
Confusion matrix:
0 1 class.error
0 673 47 0.06527778
1 106 238 0.30813953
#Prediction using the "TEST" datasetrf_predicted <-bind_cols(truth = mydata_test$cohort,predict(rf_fit, mydata_test),predict(rf_fit, mydata_test, type ="prob"))rf_predicted
#I will then use cross-validation to re-estimate the predictive ability more fairly; I will use the same folds as created for the glm model above, "mydata_folds"#workflow setuprf_workflow <-workflow() |>add_model(rf_spec) |>add_formula(cohort ~ .)set.seed (1234)#Use workflow to fit model with each fold of re-sampled datarf_fit_cv <- rf_workflow |>fit_resamples(mydata_folds, control =control_resamples(save_pred =TRUE))#one ROC plot for cross validation datasetsrf_fit_cv |>collect_predictions() |>roc_curve(cohort, .pred_0)|>autoplot()
# Collect predictions for each foldcv_predictions <- rf_fit_cv |>collect_predictions()# To see the performance of each fold using ROC AUCcv_predictions |>group_by(id) |>roc_auc(cohort, .pred_0)
# Calculate ROC data for each foldfold_roc_data <- cv_predictions |>group_by(id) |>roc_curve(cohort, .pred_0)# Calculate AUC for each foldfold_auc <- cv_predictions |>group_by(id) |>roc_auc(cohort, .pred_0)fold_auc
# Calculate the overall ROC curveoverall_roc_data <- cv_predictions |>roc_curve(cohort, .pred_0)# Create labels with AUC for each foldfold_auc_labels <- fold_auc |>mutate(label =paste0(id, " (AUC = ", round(.estimate, 3), ")"))# Update the fold IDs in cv_predictions with the new labelscv_predictions <- cv_predictions |>left_join(fold_auc_labels, by ="id") |>mutate(id = label)# Plot ROC curves for each fold with AUC in legend and overlay overall ROC curvecv_predictions |>group_by(id) |>roc_curve(cohort, .pred_0) |>autoplot() +geom_line(data = overall_roc_data, aes(x =1- specificity, y = sensitivity), color ="black", size =0.5) +labs(title ="ROC Curves for Each Fold with Overall ROC Curve",x ="False Positive Rate",y ="True Positive Rate",color ="Fold (AUC)" ) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
#To understand the variables that most contribute tot eh classification, I extract the importance scores using the vip package.rf_fit |>extract_fit_engine() |>importance()
The following object is masked from 'package:dplyr':
slice
#Model specification/buildingbt_spec <-boost_tree(trees =50, #Number of treestree_depth =4) |>#Max depth of treeset_mode("classification") |>set_engine("xgboost")bt_spec
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 50
tree_depth = 4
Computational engine: xgboost
#Recipebt_recipe <-recipe(cohort ~ ., data = mydata_training) |>step_dummy(all_nominal_predictors()) |>step_normalize(all_predictors())#Workflow specificationbt_workflow <-workflow() |>add_model(bt_spec) |>add_recipe(bt_recipe)#Model fit with training databt_fit <-fit(bt_workflow, data = mydata_training)bt_fit
#Prediction model using the "TEST" datasetmy.bt.pred <-bind_cols(truth = mydata_test$cohort,predict(bt_fit, mydata_test),predict(bt_fit, mydata_test, type ="prob"))#ROC plots for testing dataroc_auc(my.bt.pred, truth, .pred_0)
#workflow setupbt_workflow <-workflow() |>add_model(bt_spec) |>add_formula(cohort ~ .)set.seed (1234)#Use workflow to fit model with each fold of re-sampled databt_fit_cv <- bt_workflow |>fit_resamples(mydata_folds, control =control_resamples(save_pred =TRUE))#ROC plots for cross validation datasetsbt_fit_cv |>collect_predictions() |>roc_curve(cohort, .pred_0)|>autoplot()
Three predictive models were implemented to predict opioid use disorder. 10-fold cross validation was conducted to evaluate model performances. All predictive models achieved satisfactory performance When comparing roc_auc values for those 3 algoriths with cross-validation, ramdom forests had the highest value.
The RF model demonstrated the highest overall performance, with an F1-score of 0.8936, a recall of 0.9276, and the highest AUC of 0.8969, indicating superior discriminatory ability. The XGBoost model achieved the highest precision (0.8669) but had a slightly lower AUC (0.8780) compared to the other models. The 10-fold ROC curves (light blue for LR, red for RF, and green for GT) highlight the models’ performance across different subsets of the data. While RF shows the best overall sensitivity across a range of false positive rates, GT underperformed relative to RF and LR, as evidenced by its lower AUC despite visually appearing to follow RF’s trajectory. LR, although slightly less sensitive than RF in some regions, exhibited more consistent performance than GT. These findings indicate that RF offers the best overall predictive performance, while GT may struggle with reliability compared to LR and RF in this specific task.
In the 10-fold models of the random forest algorithm, the AUC values for the folds range from 0.869 to 0.927, indicating consistently strong performance across all folds, with minor variability. And they are closely clustered around the overalll curve, which suggest the model’s stability.
This model has clinical and research implications. In Practice, this model enables to identify high-risk patients early, which can help clinical decision making. Also, feature importance analyss identified top three factors that contribute to the outcome, therefore, interventions need to be developed targeting these factors. Interestingly, these results align with certain findings in genetics, particularly the shared genetic risk factors between nicotine and OUD. Also, this study highlights the role of mental defeat — a factor not yet incorporated into existing OUD models — which should be explored further in future research.This work also lays the foundation for future research by providing a strong predictive framework to support the development of more advanced machine learning models.
Limitation
The results of these analysis need to be interpreted with caution. Importantly, due to the genotypic objective of the parent study, all participants were Caucasian participants, limiting the generalizability of findings to more diverse populations. Secondly, significant number of missing data were observed in two key factors within this model. Despite this, these variables are clinically considered highly relevant to the outcome, and their imputation does not appear to have adversely affected the model’s performance, as discussed previously. Lastly, this model was developed based on a relatively small sample size. However, it was specifically tailored for chronic pain patients on long-term opioid therapy, a population at higher risk of developing OUD compared to the general population. Considering that the model incorporates various significant risk factors that are often excluded from existing machine learning models, this model offers meaningful insights and potential contributions to the field.
Conclusion
Random forest achieved the best predictive performance, with an AUC-ROC of 0.89. Feature importance analysis identified nicotine dependence, age, and mental defeat as the top three significant features.This model examined the data performance using the pre-determined variables. This model showed satisfactory prediction performance, and it can be a useful tool for clinicians to identify patients at risk for developing OUD and implement early intervention for prevention. This model not only offers a strong predictive framework, but also lays the groundwork for more advanced machine learning models that make use of a broader range of variables, ultimately improving early identification and intervention strategies for at-risk patients.